Credit risk analysis plays a crucial role in the financial industry, enabling lenders to assess the creditworthiness of potential borrowers and make informed decisions about lending. With the increasing availability of data and advancements in machine learning techniques, credit risk analysis has seen significant improvements in accuracy and efficiency.
In this Jupyter Notebook, we will explore the process of credit risk analysis using real-world credit data. Our goal is to build a predictive model that can classify borrowers into riscky and not-riscky categories, helping financial institutions minimize losses and maximize profitability.
The dataset used in this analysis contains information about various borrowers, including their age, income, loan intent, loan amount, and previous credit history. Additionally, it includes the loan grade, which indicates the level of risk associated with each loan application (ranging from "A" for low risk to "G" for high risk) and many more features.
| feature | description |
| person_age | The person's age in years |
| person_income | The person's annual income. |
| person_home_ownership | The type of home ownership (RENT, OWN, MORTGAGE, OTHER) |
| person_emp_length | the person's employment length in years. |
| loan_intent | the person's intent for the loan (PERSONAL, EDUCATION, MEDICAL, VENTURE, HOMEEMPROVEMENT, DEBTCONSOLIDATION). |
| loan_grade | the of risk on the loan(A,B,C,D,E,F,G)(A-> not riscky | G-> very riscky |
| loan_amnt | the loan amount. |
| loan_int_rate | the loan interest rate (between 6% and 21%) |
| loan_status | Shows wether the loan is currently in default with 1 being default and 0 being non-default. |
| loan_percent_income | The percentage of person's income dedicated for the mortgage. |
| cb_person_default_on_file | If the person has a default history (YES , NO). |
| cb_person_cred_hist_length | The person's credit history. |
1. Exploratory Data Analysis (EDA): Through EDA, we will gain insights into the distribution of various features, explore correlations, and identify potential patterns or trends.
2. Data Preprocessing: We will begin by cleaning and preprocessing the data to handle missing values, encode categorical variables, and prepare the data for modeling.
3. Feature Selection: To build an effective credit risk model, we will select relevant features and examine their impact on the target variable.
4. Model Building Using machine learning algorithms such as XGBoost, Random Forest, and Logistic Regression, we will train predictive models to classify borrowers as low-risk or high-risk.
5. Hyperparameter Tuning: Fine-tuning the model hyperparameters will help optimize their performance and make more accurate predictions.
6. Model Evaluation: We will evaluate the performance of each model using appropriate metrics, such as accuracy, precision, recall, and F1 score.
7. Credit Risk Prediction: Using the selected model, we will predict the credit risk of new loan applicants and classify them into appropriate risk categories.
8. Conclusion: Finally, we will summarize our findings, discuss the model's effectiveness, and provide recommendations for future improvements.
## Basic Libraries:
import pandas as pd
pd.options.display.max_colwidth=150 ## this is used to set the column width.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
import joblib
warnings.filterwarnings("ignore")
%matplotlib inline
## For making sample data:
from sklearn.datasets import make_classification
## For Preprocessing:
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, RepeatedKFold,RepeatedStratifiedKFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error
# from sklearn.base import TransformerMixin,BaseEstimator
## Using imblearn library:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
## Using msno Library for Missing Value analysis:
import missingno as msno
## For Metrics:
from sklearn.metrics import plot_precision_recall_curve,accuracy_score
from sklearn.metrics import plot_confusion_matrix, confusion_matrix, classification_report
from sklearn.model_selection import learning_curve
## For Machine Learning Models:
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
import pickle
## Setting the seed to allow reproducibility
np.random.seed(31415)
df = pd.read_csv("./credit_risk_dataset.csv")
df.head(10)
| person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_status | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22 | 59000 | RENT | 123.0 | PERSONAL | D | 35000 | 16.02 | 1 | 0.59 | Y | 3 |
| 1 | 21 | 9600 | OWN | 5.0 | EDUCATION | B | 1000 | 11.14 | 0 | 0.10 | N | 2 |
| 2 | 25 | 9600 | MORTGAGE | 1.0 | MEDICAL | C | 5500 | 12.87 | 1 | 0.57 | N | 3 |
| 3 | 23 | 65500 | RENT | 4.0 | MEDICAL | C | 35000 | 15.23 | 1 | 0.53 | N | 2 |
| 4 | 24 | 54400 | RENT | 8.0 | MEDICAL | C | 35000 | 14.27 | 1 | 0.55 | Y | 4 |
| 5 | 21 | 9900 | OWN | 2.0 | VENTURE | A | 2500 | 7.14 | 1 | 0.25 | N | 2 |
| 6 | 26 | 77100 | RENT | 8.0 | EDUCATION | B | 35000 | 12.42 | 1 | 0.45 | N | 3 |
| 7 | 24 | 78956 | RENT | 5.0 | MEDICAL | B | 35000 | 11.11 | 1 | 0.44 | N | 4 |
| 8 | 24 | 83000 | RENT | 8.0 | PERSONAL | A | 35000 | 8.90 | 1 | 0.42 | N | 2 |
| 9 | 21 | 10000 | OWN | 6.0 | VENTURE | D | 1600 | 14.74 | 1 | 0.16 | N | 3 |
df.shape[0],df.shape[1]
(32581, 12)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32581 entries, 0 to 32580 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 person_age 32581 non-null int64 1 person_income 32581 non-null int64 2 person_home_ownership 32581 non-null object 3 person_emp_length 31686 non-null float64 4 loan_intent 32581 non-null object 5 loan_grade 32581 non-null object 6 loan_amnt 32581 non-null int64 7 loan_int_rate 29465 non-null float64 8 loan_status 32581 non-null int64 9 loan_percent_income 32581 non-null float64 10 cb_person_default_on_file 32581 non-null object 11 cb_person_cred_hist_length 32581 non-null int64 dtypes: float64(3), int64(5), object(4) memory usage: 3.0+ MB
df.describe()
| person_age | person_income | person_emp_length | loan_amnt | loan_int_rate | loan_status | loan_percent_income | cb_person_cred_hist_length | |
|---|---|---|---|---|---|---|---|---|
| count | 32581.000000 | 3.258100e+04 | 31686.000000 | 32581.000000 | 29465.000000 | 32581.000000 | 32581.000000 | 32581.000000 |
| mean | 27.734600 | 6.607485e+04 | 4.789686 | 9589.371106 | 11.011695 | 0.218164 | 0.170203 | 5.804211 |
| std | 6.348078 | 6.198312e+04 | 4.142630 | 6322.086646 | 3.240459 | 0.413006 | 0.106782 | 4.055001 |
| min | 20.000000 | 4.000000e+03 | 0.000000 | 500.000000 | 5.420000 | 0.000000 | 0.000000 | 2.000000 |
| 25% | 23.000000 | 3.850000e+04 | 2.000000 | 5000.000000 | 7.900000 | 0.000000 | 0.090000 | 3.000000 |
| 50% | 26.000000 | 5.500000e+04 | 4.000000 | 8000.000000 | 10.990000 | 0.000000 | 0.150000 | 4.000000 |
| 75% | 30.000000 | 7.920000e+04 | 7.000000 | 12200.000000 | 13.470000 | 0.000000 | 0.230000 | 8.000000 |
| max | 144.000000 | 6.000000e+06 | 123.000000 | 35000.000000 | 23.220000 | 1.000000 | 0.830000 | 30.000000 |
## Checking for Duplicates
dups = df.duplicated()
dups.value_counts() #There are 165 Duplicated rows
False 32416 True 165 dtype: int64
## Removing the Duplicates
df.drop_duplicates(inplace=True)
df.drop(['loan_int_rate'],axis=1,inplace=True)
ccol=df.select_dtypes(include=["object"]).columns
ncol=df.select_dtypes(include=["int","float"]).columns
print("The number of Categorical columns are:",len(ccol))
print("The number of Numerical columns are:",len(ncol))
The number of Categorical columns are: 4 The number of Numerical columns are: 7
print("The NUMERICAL columns are:\n")
for i in ncol:
print("->",i,"-",df[i].nunique())
print("\n---------------------------\n")
print("The CATEGORICAL columns are:\n")
for i in ccol:
print("->",i,"-",df[i].nunique())
The NUMERICAL columns are: -> person_age - 58 -> person_income - 4295 -> person_emp_length - 36 -> loan_amnt - 753 -> loan_status - 2 -> loan_percent_income - 77 -> cb_person_cred_hist_length - 29 --------------------------- The CATEGORICAL columns are: -> person_home_ownership - 4 -> loan_intent - 6 -> loan_grade - 7 -> cb_person_default_on_file - 2
for col in ncol:
min_value = df[col].min()
max_value = df[col].max()
print(f'Range for {col} : [{min_value} to {max_value}]')
Range for person_age : [20 to 144] Range for person_income : [4000 to 6000000] Range for person_emp_length : [0.0 to 123.0] Range for loan_amnt : [500 to 35000] Range for loan_status : [0 to 1] Range for loan_percent_income : [0.0 to 0.83] Range for cb_person_cred_hist_length : [2 to 30]
plt.figure(figsize=(10,7))
for index, col in enumerate(ccol):
plt.subplot(2,3, index+1)
sns.countplot(x=col, hue='loan_status', data=df, palette='Blues')
plt.xticks(rotation=90)
plt.tight_layout()
# Individual frequency plot
plt.figure(figsize=(10,7))
for index, col in enumerate(ccol):
plt.subplot(2,3, index+1)
sns.countplot(x=col, palette='Blues', data= df)
plt.xticks(rotation=90)
plt.tight_layout()
loan_intent_counts = df['loan_intent'].value_counts()
# Create the pie chart using Plotly
fig = px.pie(loan_intent_counts, names=loan_intent_counts.index, values=loan_intent_counts.values,
title='Pie Chart of Loan Intent', color_discrete_sequence=px.colors.sequential.Viridis)
# Show the plot
fig.show()
mean_income_by_ownership = df.groupby('person_home_ownership')['person_income'].mean().reset_index()
# Create the bar plot using Plotly
fig = px.bar(mean_income_by_ownership, x='person_home_ownership', y='person_income',
title='Mean Person Income by Home Ownership', color='person_home_ownership',
color_discrete_sequence=px.colors.sequential.Viridis)
# Show the plot
fig.show()
plt.figure(figsize=(8, 6))
df['person_age'].plot.hist(bins=10, color='skyblue', edgecolor='black')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Person Age')
plt.show()
plt.figure(figsize=(8, 6))
df.boxplot(column='person_income', vert=False)
plt.xlabel('Income')
plt.title('Boxplot of Person Income')
plt.show()
toutes les person_incomes de plus de 1.5M c des valeurs aberantes
# Generate the correlation matrix
correlation_matrix = df.corr()
# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
# Add title
plt.title('Correlation Heatmap')
# Show the plot
plt.show()
person_age -> cv_person_cred_hist_length: Une forte corrélation positive entre l'âge de la personne et la longueur de l'historique de crédit peut indiquer que les personnes plus âgées ont tendance à avoir des historiques de crédit plus longs. Cela est généralement attendu, car les personnes plus âgées ont eu plus de temps pour établir leur historique de crédit.
loan_amount -> loan_percent_income: La forte corrélation positive entre le montant du prêt et le pourcentage du revenu alloué au prêt suggère que les montants de prêts accordés augmentent généralement à mesure que le pourcentage du revenu alloué au remboursement du prêt augmente. Cela peut indiquer que les prêteurs accordent des montants de prêts plus élevés à ceux qui consacrent une plus grande partie de leur revenu au remboursement.
loan_amount -> person_income: La forte corrélation positive entre le montant du prêt et le revenu de la personne indique que les personnes avec des revenus plus élevés ont tendance à obtenir des montants de prêts plus élevés. Cela est généralement attendu, car un revenu plus élevé peut être associé à une plus grande capacité de remboursement.
loan_status -> loan_percent_income: La forte corrélation positive entre le statut du prêt et le pourcentage du revenu alloué au prêt suggère que les prêts avec des pourcentages de revenu plus élevés peuvent avoir des chances plus élevées d'être en défaut.
loan_status -> loan_int_rate: La forte corrélation positive entre le statut du prêt et le taux d'intérêt indique que les prêts avec des taux d'intérêt plus élevés peuvent avoir des chances plus élevées d'être en défaut.
person_income -> loan_percent_income: La forte corrélation négative entre le revenu de la personne et le pourcentage du revenu alloué au prêt indique que les personnes avec un revenu plus élevé allouent généralement une plus petite partie de leur revenu au remboursement du prêt.
# Create the scatter plot using Plotly
fig = px.scatter(df, x='person_age', y='person_income', title='Scatter Plot of Age vs. Income', color='person_income',
color_continuous_scale=px.colors.sequential.Viridis)
# Show the plot
fig.show()
# Calculate the sum of 'person_income' for each category of 'person_home_ownership'
income_by_ownership = df.groupby('person_home_ownership')['person_income'].sum().reset_index()
# Get the list of categories and the total income for each category
categories = income_by_ownership['person_home_ownership']
total_income = income_by_ownership['person_income']
# Create the stacked bar plot using Pyplot
plt.figure(figsize=(10, 6))
plt.bar(categories, total_income, color='skyblue')
# Add labels and title
plt.xlabel('Home Ownership')
plt.ylabel('Total Income')
plt.title('Total Income by Home Ownership')
# Show the plot
plt.show()
# Scatter plot: person_age vs. cv_person_cred_hist_length
plt.figure(figsize=(8, 6))
plt.scatter(df['person_age'], df['cb_person_cred_hist_length'], marker='o', color='skyblue')
plt.xlabel('Person Age')
plt.ylabel('Credit History Length')
plt.title('Scatter Plot: Person Age vs. Credit History Length')
plt.show()
# Scatter plot: loan_amount vs. loan_percent_income
plt.figure(figsize=(8, 6))
plt.scatter(df['loan_amnt'], df['loan_percent_income'], marker='o', color='green')
plt.xlabel('Loan Amount')
plt.ylabel('Loan Percent Income')
plt.title('Scatter Plot: Loan Amount vs. Loan Percent Income')
plt.show()
# Scatter plot: loan_amount vs. person_income
plt.figure(figsize=(8, 6))
plt.scatter(df['loan_amnt'], df['person_income'], marker='o', color='orange')
plt.xlabel('Loan Amount')
plt.ylabel('Person Income')
plt.title('Scatter Plot: Loan Amount vs. Person Income')
plt.show()
numerical_columns = ncol
# Create histograms for each numerical column
plt.figure(figsize=(12, 8))
for i, col in enumerate(numerical_columns[:6], 1): # Limit to 6 columns to fit in the grid
plt.subplot(2, 3, i)
plt.hist(df[col], bins=20, edgecolor='black')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
# Box plot: loan_status vs. loan_percent_income
plt.figure(figsize=(8, 6))
plt.boxplot([df[df['loan_status'] == 0]['loan_percent_income'],
df[df['loan_status'] == 1]['loan_percent_income']],
labels=['Paid', 'Default'], showfliers=False, notch=True, patch_artist=True)
plt.xlabel('Loan Status')
plt.ylabel('Loan Percent Income')
plt.title('Box Plot: Loan Status vs. Loan Percent Income')
plt.show()
sns.countplot(x=df['loan_status'], palette='Oranges')
plt.title('Distribution of Risk')
plt.show()
df['loan_status'].value_counts().plot(kind='pie', autopct='%1.2f%%', explode=[0,0.1], shadow=True)
<AxesSubplot:ylabel='loan_status'>
The Data is highly IMBALANCED. We will deal with oversampling techniques like KNN-SMOTE to solve this issue.
Missing data, or missing values, occur when you don’t have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.
There are typically 3 types of missing values:
Missing completely at random (MCAR)
Missing at random (MAR)
Missing not at random (MNAR)
Problems: Missing data are problematic because, depending on the type, they can sometimes cause sampling bias. This means your results may not be generalizable outside of your study because your data come from an unrepresentative sample.
df.isnull().any()
person_age False person_income False person_home_ownership False person_emp_length True loan_intent False loan_grade False loan_amnt False loan_status False loan_percent_income False cb_person_default_on_file False cb_person_cred_hist_length False dtype: bool
df.isna().sum()
person_age 0 person_income 0 person_home_ownership 0 person_emp_length 887 loan_intent 0 loan_grade 0 loan_amnt 0 loan_status 0 loan_percent_income 0 cb_person_default_on_file 0 cb_person_cred_hist_length 0 dtype: int64
msno.bar(df)
<AxesSubplot:>
NOTE: EVERY PREPROCESSING TECHNIQUE IS DONE ONLY ON THE TRAIN SET. SO SPLITTING IS MANDATORY BEFORE OUTLIER REMOVAL, MISSING VALUES HANDLING, OVERSAMPLING, ETC...
# we split the data to train / test parts
X_train, X_test, y_train, y_test = train_test_split(df.drop('loan_status', axis=1), df['loan_status'],
random_state=0, test_size=0.2, stratify=df['loan_status'],
shuffle=True)
#To print the number of unique values:
for col in X_train:
print(col, '--->', X_train[col].nunique())
if X_train[col].nunique()<20:
print(X_train[col].value_counts(normalize=True)*100)
print()
person_age ---> 58 person_income ---> 3680 person_home_ownership ---> 4 RENT 50.320068 MORTGAGE 41.439149 OWN 7.916859 OTHER 0.323924 Name: person_home_ownership, dtype: float64 person_emp_length ---> 36 loan_intent ---> 6 EDUCATION 19.809502 MEDICAL 18.787598 VENTURE 17.542033 PERSONAL 16.878760 DEBTCONSOLIDATION 15.968687 HOMEIMPROVEMENT 11.013420 Name: loan_intent, dtype: float64 loan_grade ---> 7 A 32.932284 B 32.126330 C 19.902052 D 11.121394 E 3.004010 F 0.732685 G 0.181243 Name: loan_grade, dtype: float64 loan_amnt ---> 710 loan_percent_income ---> 75 cb_person_default_on_file ---> 2 N 82.392411 Y 17.607589 Name: cb_person_default_on_file, dtype: float64 cb_person_cred_hist_length ---> 29
X_train.loc[X_train['person_age']>=80, :]
X_train = X_train.loc[X_train['person_age']<=80, :]
X_train.loc[X_train['person_emp_length']>=60, :]
X_train = X_train.loc[X_train['person_emp_length']<60, :]
X_train.loc[X_train['person_income']>=2000000, :]
X_train = X_train.loc[X_train['person_income']<=2000000, :]
y_train = y_train[X_train.index]
y_train.shape
(25196,)
1- Iterative imputer - To handle missing values
2- Scaling - To maintain the scale among features
1- One Hot Encoder - To encode each categoric for model interpretability
#Create the main pipeline for preprocessing numerical variables:
numerical_pipeline = Pipeline([
('imputer', IterativeImputer()), # Impute missing values using iterative imputer
('scaler', StandardScaler()) # Scale numerical features
])
#Create the pipeline for preprocessing categorical variables:
categorical_pipeline = Pipeline([
('encoder', OneHotEncoder()) # One-hot encode categorical features
])
# Replace 'numerical_features' and 'categorical_features' with lists of your numerical and categorical feature names
numerical_features = X_train.select_dtypes(include='number').columns.tolist()
categorical_features = X_train.select_dtypes(include='object').columns.tolist()
preprocessor = ColumnTransformer([
('numerical', numerical_pipeline, numerical_features),
('categorical', categorical_pipeline, categorical_features)
])
#Fit and transform the main pipeline on the training data:
X_train_preprocessed = preprocessor.fit_transform(X_train)
def fit_preprocessing_pipeline(X_train):
return preprocessing_pipeline.fit(X_train)
joblib.dump(preprocessor, 'preprocessing_pipeline.pkl')
['preprocessing_pipeline.pkl']
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_preprocessed, y_train)
# Replace numeric class labels with words
class_labels_mapping = {0: 'paid', 1: 'default'}
y_train_mapped = y_train.map(class_labels_mapping)
y_train_balanced_mapped = pd.Series(y_train_balanced).map(class_labels_mapping)
# Create bar plot for class distribution before SMOTE with words
plt.figure(figsize=(6, 4))
y_train_mapped.value_counts().plot(kind='bar')
plt.xlabel('loan_status')
plt.ylabel('Count')
plt.title('Class Distribution Before SMOTE')
plt.xticks(rotation=0)
plt.show()
# Create bar plot for class distribution after SMOTE with words
plt.figure(figsize=(6, 4))
y_train_balanced_mapped.value_counts().plot(kind='bar')
plt.xlabel('loan_status')
plt.ylabel('Count')
plt.title('Class Distribution After SMOTE')
plt.xticks(rotation=0)
plt.show()
X_test_processed = preprocessor.fit_transform(X_test)
# Define the models and their respective hyperparameter grids
models = {
'XGBoost': (XGBClassifier(), {'n_estimators': [i*100 for i in range(4)], 'max_depth': [6,8,10], 'learning_rate': [0.01, 0.05, 0.1]}),
'Logistic Regression': (LogisticRegression(), {'C': [0.01, 0.1, 1, 10]}),
#'SVM': (SVC(), {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}),
'Neural Network': (MLPClassifier(), {'hidden_layer_sizes': [(100,), (100, 50)], 'activation': ['relu', 'tanh']}),
'Random Forest': (RandomForestClassifier(random_state=0, class_weight='balanced'), {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}),
}
# Define a dictionary to store the evaluation metrics for each model
evaluation_metrics = {
'Model': [],
'Cross-Val Score': [],
'Accuracy': [],
'F1 Score': [],
'MSRE': []
}
# Create a dictionary to store the best models
best_models = {}
# Perform cross-validation and hyperparameter tuning for each model
for model_name, (model, param_grid) in models.items():
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train_balanced, y_train_balanced)
print(f"Model: {model_name}")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}\n")
# Append the evaluation metrics to the dictionary
evaluation_metrics['Model'].append(model_name)
evaluation_metrics['Cross-Val Score'].append(grid_search.best_score_)
# Get the best model
best_model = grid_search.best_estimator_
# Store the best model in the dictionary
best_models[model_name] = best_model
# Predict the test set using the best model
y_pred = best_model.predict(X_test_processed)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
evaluation_metrics['Accuracy'].append(accuracy)
# Calculate F1 score
f1 = f1_score(y_test, y_pred, average='weighted')
evaluation_metrics['F1 Score'].append(f1)
# Calculate MSRE
msre = mean_squared_error(y_test, y_pred)
evaluation_metrics['MSRE'].append(msre)
# Convert the dictionary to a Pandas DataFrame for easy plotting
metrics_df = pd.DataFrame(evaluation_metrics)
# Plot the evaluation metrics
plt.figure(figsize=(10, 6))
plt.bar(metrics_df['Model'], metrics_df['Cross-Val Score'], label='Cross-Val Score', alpha=0.7)
plt.bar(metrics_df['Model'], metrics_df['Accuracy'], label='Accuracy', alpha=0.7)
plt.bar(metrics_df['Model'], metrics_df['F1 Score'], label='F1 Score', alpha=0.7)
plt.bar(metrics_df['Model'], metrics_df['MSRE'], label='MSRE', alpha=0.7)
plt.xticks(rotation=45)
plt.xlabel('Model')
plt.ylabel('Score')
plt.title('Model Evaluation Metrics')
plt.legend()
plt.tight_layout()
plt.show()
# After the loop, training is complete
print("Training completed!")
Model: XGBoost
Best parameters: {'learning_rate': 0.1, 'max_depth': 8, 'n_estimators': 300}
Best cross-validation score: 0.955
Model: Logistic Regression
Best parameters: {'C': 10}
Best cross-validation score: 0.801
Model: Neural Network
Best parameters: {'activation': 'tanh', 'hidden_layer_sizes': (100, 50)}
Best cross-validation score: 0.908
Model: Random Forest
Best parameters: {'max_depth': None, 'n_estimators': 300}
Best cross-validation score: 0.948
Training completed!
evaluation_metrics = pd.DataFrame(evaluation_metrics)
metrics_to_plot = ['Cross-Val Score', 'Accuracy', 'F1 Score', 'MSRE']
for metric in metrics_to_plot:
plt.figure(figsize=(8, 6))
plt.bar(evaluation_metrics['Model'], evaluation_metrics[metric], alpha=0.7)
plt.xticks(rotation=45)
plt.xlabel('Model')
plt.ylabel('Score')
plt.title(f'Model Evaluation Metric: {metric}')
plt.tight_layout()
plt.show()
models = {
'Logistic Regression': LogisticRegression(),
'Random Forest': RandomForestClassifier(random_state=0, class_weight='balanced'),
'Neural Network': MLPClassifier()
}
# Create a function to plot the learning curve
def plot_learning_curve(model, X, y):
train_sizes, train_scores, test_scores = learning_curve(model, X, y, train_sizes=np.linspace(0.1, 1.0, 5), cv=5, scoring='accuracy')
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.figure(figsize=(8, 6))
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1, color='orange')
plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
plt.plot(train_sizes, test_mean, 'o-', color='orange', label='Cross-Validation Score')
plt.xlabel('Training Examples')
plt.ylabel('Score')
plt.title(f'Learning Curve for {model.__class__.__name__}')
plt.legend()
plt.grid(True)
plt.show()
# Loop over the models and plot the learning curve for each
for model_name, model in models.items():
plot_learning_curve(model, X_train_balanced, y_train_balanced)
# Create a dictionary to store confusion matrices for each model
conf_matrices = {}
# Loop over the models and calculate the confusion matrix for each
for model_name, model in best_models.items():
y_pred = model.predict(X_test_processed)
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrices[model_name] = conf_matrix
# Plot the confusion matrices
plt.figure(figsize=(12, 8))
for i, (model_name, conf_matrix) in enumerate(conf_matrices.items()):
plt.subplot(2, 2, i + 1)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False, square=True)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title(f'Confusion Matrix - {model_name}')
plt.tight_layout()
plt.show()
# Save the ML Pipeline:
joblib.dump(model, 'best_model.pkl')
['best_model.pkl']
in this project i have been exposed to a lot of concepts like:
--> building a pipeline
--> hyperparameter tuning
--> evaluating models
--> building my first streamlit application
--> deploying it.
In conclusion, this credit risk analysis project demonstrates the power of data science and machine learning in the financial industry. This web app can serve as a valuable tool for financial institutions to assess credit risk, make informed lending decisions, and mitigate potential losses.
However, as with any data science project, there are a few points to keep in mind:
Continuous monitoring: Credit risk is a dynamic domain, and models need regular updates to adapt to changing economic conditions and borrower behaviors.
Model Robustness: Although the achieved accuracy is excellent, it's essential to test the model's robustness on a wider range of scenarios and data distributions.
Ethical Considerations: Credit risk models must be fair and unbiased. Continuously monitor for any potential bias and ensure fairness in lending decisions.
Model Deployment: Deploying a machine learning model in production involves careful considerations, such as scalability, security, and version control.